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Abstract — We present new algorithms for inverse reinforce- 
ment learning (IRL, or inverse optimal control) in convex 
optimization settings. We argue that finite-space IRL can be 
posed as a convex quadratic program under a Bayesian infer- 
ence framework with the objective of maximum a posteriori 
estimation. To deal with problems in large or even infinite state 
space, we propose a Gaussian process model and use preference 
graphs to represent observations of decision trajectories. Our 
method is distinguished from other approaches to IRL in 
that it makes no assumptions about the form of the reward 
function and yet it retains the promise of computationally 
manageable implementations for potential real-world applica- 
tions. In comparison with an establish algorithm on small-scale 
numerical problems, our method demonstrated better accuracy 
in apprenticeship learning and a more robust dependence on 
the number of observations. 

I. Introduction 

Imitation learning is a subfield of machine learning in 
which the objective is to learn to mimic human behavior 
solely through observation of the actions taken by the subject. 
Technical approaches to imitation learning generally fall into 
two broad categories [1]. One category contains behavioral 
cloning approaches that attempt to use supervised learning to 
predict actions directly from observations of features of the 
environment. The other category consists of IRL approaches, 
first introduced in [2], use training examples in the form of 
decision trajectories defined in terms of a Markov decision 
process (MDP) model of the underlying sequential decision 
task. IRL algorithms attempt to discover the reward function 
for the MDP solely on the basis of observations of a decision- 
maker's solution to that problem. This approach is appealing 
because knowledge of the reward function offers the promise 
that behavior can be predicted in domains unseen during the 
period of observation. 

A variety of approaches have been proposed for IRL. In 
early work, Ng and Russel [2] advance the key idea of 
choosing the reward function to maximize the difference 
between the optimal and suboptimal policies, under the as- 
sumption that the reward function can be approximated by a 
linear combination of basis functions. A principal motivation 
for considering IRL problems is the idea of apprenticeship 
learning, in which observations of state-action pairs are used 
to learn the policies followed by experts for the purpose of 
mimicking or cloning behavior. By its nature, apprenticeship 
learning problems arise in situations where it is not possible 
or desirable to observe all state-action pairs for the deci- 



sion maker's policy. In recent approaches to apprenticeship 
learning, partial policy observation is dealt with by searching 
mixed solutions in a space of learned policies with the goal 
that the accumulative feature expectation is near that of the 
expert [3], [4]. In such approaches, the reward function is 
approximated by a linear combination of features, which 
in turn allows for linear approximation of value functions 
with consequent simplification of the learning problem. In 
such methods, algorithm performance is strongly influenced 
by the modeler's choice of features. Another algorithm for 
IRL is policy matching in which the loss function penalizing 
deviations from expert's policy is minimized by tuning the 
parameters of reward functions [5]. 

The assumption that the reward function can be linearly 
approximated, which underlies a number of IRL approaches, 
may not be reasonable for many problems of practical 
interest. The ill-posed nature of the inverse learning problem 
also presents difficulties. Multiple reward functions may 
yield the same optimal policy, and there may be multiple 
observations at a state given the true reward function. To 
deal with these problems, we design algorithms that do 
not assume linear structure for reward function, but yet 
remain computationally efficient. In particular, we propose 
new IRL models and algorithms that assign a Gaussian prior 
on the reward function or treat the reward function as a 
Gaussian process. This approach is similar in perspective to 
that Ramachandran and Eyal [6], who view the state-action 
samples from the expert as the evidence that will be used 
to update a prior on the reward function, under a Bayesian 
framework. Other approaches to IRL include game-theoretic 
methods [7] and algorithms derived from linearly-solvable 
stochastic optimal control [8]. 

The main contributions of our work are as follows. First, 
we model the reward function in a finite state space using a 
Bayesian framework with known Gaussian priors. We show 
that this problem is a convex quadratic program, and hence 
that it can be efficiently solved. Second, for the general 
case that allows noisy observation of incomplete policies, 
representation of the reward function is challenging and 
requires more computation. We show that Gaussian process 
is appropriate in that case. Our model constructs a preference 
graph in action space to represent the multiple observations 
at a state. Even in cases where the state space is much larger 
than the number of observations, IRL via Gaussian processes 
has the promise of offering robust predictions and results that 



are relatively insensitive to number of observations. 

It is worth mentioning here that the preference graph we 
use in IRL is based on an understanding of the agent's 
preferences over action space. In the machine learning lit- 
erature, there has been study of a learning scenario called 
learning label preference that focuses on finding the latent 
function that predicts preference relations among a finite set 
of labels. This scenario is a generalization of some stan- 
dard problems, such as classification and label ranking [9]. 
Considering the latent function values as a Gaussian process, 
Chu and Ghahramani [10] observed that Bayesian framework 
is an efficient and competitive method for learning label 
preferences, and they proposed a novel likelihood function 
to capture preference relations and the use of a Gaussian 
process model for learning label preferences. We also use 
Bayesian inference and build off several of the ideas in 
[10] and related work, but our method differs from label 
preference learning for classification and label ranking. Our 
input data depends on states and actions in the context of an 
MDP. Moreover, we are learning the reward that indirectly 
determines how actions are chosen during the sequential 
evolution of an MDP, while preference learning studies the 
latent functions preserving preferences. 

The rest of this paper is organized as follows: In Section 
HI! we introduce IRL preliminaries. In Sections [ill] and |IV] 
we propose our principal models and algorithms. In Section 
fVl we describe the results of two small-scale numerical ex- 
periments. Finally, in Section [VI] we offer some concluding 
remarks. 

II. Preliminaries 

A finite-state, infinite horizon Markov decision process 
(MDP) is defined as a tuple M = (5, A, V, 7, r), where S = 
{si, S2, ■ ■ ■ , s n } is a set of n states; A = {at, 02, • • ■ , a m } is 
a set of m actions; V — {P aj } is a set of state transition 
probabilities; 7 is a discount factor; and r is the reward 
function which can be written as r(s, a), if we define it as 
depending on state s and action a. For any a £ A and P Q is 
a n x n matrix, each row of which, denoted as P os , is the 
transition probabilities upon taking action a in state s. 

Consider a decision maker who selects actions according 
to a policy it : S — > A that maps states to actions. Define 
the value function at state s with respect to policy it to be 
V" K (s) — £E* 7*r(s*, 7r(s'))|7r], where the expectation 
is over the distribution of the state sequence {s^s 1 , . . . } 
given policy ir, where superscripts index time. A decision 
maker who aims to maximize expected reward will, at every 
state s, choose the action that maximizes V*(s). Similarly, 
define the Q-factor for state s and action a under policy 
7r, Q' K (s,a), to be the expected return from state s, taking 
action a and thereafter following policy ir. Given a policy ir, 
Vs eS,aeA, V*(s) and Q 7r (s, a) satisfy 

V*(s) = r( S ,7r( S ))+7^P 7rWs ( S ')V 7r (s / ) 

s' 



The well-known Bellman optimality conditions state that 
7r is optimal if and only if, Vs S S, we have ir(s) 6 
argmax ae ^(3 7, '(s,a) [11]. 

Given an MDP M = (S,A,V,i, r), let us de- 
fine the inverse Markov decision process (IMDP) Mi = 
(S, A, V, 7, O). The process Mj includes the states, actions, 
and dynamics of M, but lacks a specification of the reward 
vector, r. By way of compensation, Mj includes a set of 
observations O that consists of state-action pairs generated 
through the observation of a decision maker. We can define 
the inverse reinforcement learning (IRL) problem associated 
with Af/ = (S, A, V, 7, O) to be that of finding the reward 
function r such that the observations O could have come 
from an optimal policy for M — (S,A,V,j,r). The IRL 
problem is, in general, highly underspecified, which has 
led researchers to consider various models for restricting 
the set of reward vectors under consideration. In a seminal 
consideration of IMDPs and associated IRL problems, Ng 
and Russel [2] observe that, by the optimality equations, the 
only reward vectors consistent with an optimal policy it are 
those that satisfy the set of inequalities 

(P 7r -P a )(J„- 7 P 7r )- 1 r > 0, Va G A, (1) 

where P n is the transition probability matrix relating to 
observed policy 7r, P a denotes the transition probability 
matrix for other actions, I n is a n x n identity matrix, and 
r is a reward vector that depends only on state. Note that 
the trivial solution r = satisfies these constraints, which 
highlights the underspecified nature of the problem and 
the need for reward selection mechanisms. Ng and Russel 
[2] choose the reward function to maximize the difference 
between the optimal and suboptimal policies, which can be 
done using a linear programming formulation. In the sections 
that follow, we propose the idea of selecting reward on 
the basis of Maximum a posterior (MAP) estimation in a 
Bayesian framework. 

III. Bayesian IRL with Gaussian Distribution 

Suppose that we have a prior distribution p(r) for the 
rewards in an IMDP Mj, along with a likelihood function 
p(0\r). Then we can define the associated Bayesian IRL 
problem to be that of finding the MAP estimate of r. In this 
section we consider this problem for priors with a Gaussian 
distribution, showing that the MAP estimation problem can 
be formulated as a convex optimization problem. We assume 
all the states, value functions, and transition probabilities can 
be stored in the memory of a computer. 

Specifically, let r € 5R™ be a random vector only depending 
on state. The entry r(s;) denotes the reward at i-th state. 
We assign a Gaussian prior on the r: r ~ A/"(/i r , S r ). This 
is a subjective distribution; before anything is known about 
optimal policies for the MDP, the learner has characterized 
a prior belief by p, r with confidence by S r . 

One can envision two principal types of experiments for 
collecting a set of observations O: 

1) Decision Mapping: the observations are obtained by 
finding a mapping between state and action; e.g., we 




Fig. 1. An example showing the Bayesian IRL given full observation of 
the decision maker's policy. 



ask the expert which action he, she, or it would choose 
at state s, and then repeat the process. Ultimately, we 
will have a set of independent state-action pairs, 0\ = 

{(A«")}L r 

2) Decision Trajectory: Given an initial state, we simulate 
the decision problem and record the history of the 
expert's behavior, 2 — {s 1 , a 1 , s 2 , a 2 , • ■ • , s*, a*}. 

Formally, we define an experiment E to be a triple 
(O, r, {p(0\r)}), where O is a random vector with probabil- 
ity mass function p(0\r) for some r in the function space. 
Given what experiment E was performed and a particular 
observation of O, the experimenter is able to make inference 
and draw some evidence about r arising from E and O. This 
evidence we denote by Ev(E,0). Consider observations 
made using decision mapping 0\ and decision trajectory 2 , 
with corresponding experiments E\ = (0 1, r, {p(0 i\r)}) 
and E 2 = (0 2 , r, {p(0 2 \r)}). We would like to show that 
Ev(Ei, Ox) = Ev{E 2 , 2 ), if the states in O x and 2 are 
the same. This fact implies that inference conclusions drawn 
from 0i and 2 should be identical. 

Making use of independence of state-action pairs in deci- 
sion mapping, we calculate the joint probability density as 



p(Oi|r) = l[p(s\a h \r) = l[p(s h )p(a h \s h ,r). 

h=l h=l 

Considering Markov transition in decision trajectory, we 
write the joint probability density as 

t 

P (0 2 \r) =p(s 1 )p(a 1 \s\r)l[ P (s h \s h -\a h - 1 ) P (a h \ S h ,r). 

h=2 

Finally, we get p(d|r) = c(O u 2 )p(0 2 |r), where 
c(Oi,0 2 ) is a constant. The above equation implies an 
equivalence of evidence for inference of r between the use 
of a decision map or a decision trajectory. 

To simplify computation, we eliminate the elements in 
likelihood function p(0\r) that do not contain r, which yields 
p(0\r) = Y\ t h=1 p(a h \s h ,r). Further, we model p(a h \s h , r) 
by 



P(a h \s\r) 



1, if Q(s h ,a h ) > Q(s h ,a), Va e A 
0, otherwise. 



This form for the likelihood function is based on the assump- 
tion each observed action is an optimal choice on the part 
of the expert. Note that the set of reward values that make 
p(a h \s h ,r) equal to one is given by Eq. Q] 

Proposition 1: Assume a countable state and control 
space and a stationary policy. Then IRL using Bayesian MAP 
inference is a quadratic convex programming problem. 

Proof: By Bayes rule, the posterior distribution of 
reward 



p(r\0) = 



1 



(27r)"/2|E r | 1 /: 



' 6X P ( -o( r - Mr) T S,, X (r - Ll r 



This posterior probability p(r\0) quantifies the evidence that 
r is the reward for the observations in O. Using Eq. Q] we 
formulate the IRL problem as 



1 



(2) 



min ~(r-/v) S r (r-/v) 

r 2 

S.t. (P„. - PaWn ~ iPa'T 1 * > 0, Va € A (3) 
I*min ^ 1* ^- ''max 

Since the objective is convex quadratic and constraints are 
affine, Problem 3 is a convex quadratic program. ■ 

Fig. Q] shows a Gaussian prior on reward and its posterior 
after truncation by accounting for the linear constraints on 
reward implied by observation O. Note the shift in mode. 

The development above assumes the availability of a 
complete set of observations, giving the optimal action at 
every state. If necessary, it may be possible to expand 
observations of partial policies to fit the framework. A naive 
approach would be to state transition probabilities averaged 
over all possible actions at unobserved states. 

IV. Gaussian processes for generalized IRL 

In this section, we introduce a Gaussian process IRL 
model. Our model involves the construction of a preference 
graph, defined below, that is used to record the actions of 
the expert under observation. The choice of one action over 
the others at any given state will be governed by Q-function 
values, if the expert acts optimally. Hence, these values may 
be used to define preference relations among actions. 

Definition 1 At state Si € S, Vd, a £ A, we define the 
preference relation as: if Q(si, a) > Q(si, a), the action a is 
weakly preferred to d, denoted as a a; strictly preferred, 
denoted as a y Si d,if and only if Q(si,a) > Q(si,a); a is 
equivalent to a, denoted as a ~ Si d, if and only if d a 
and d d. 

Definition 2 A preference graph over action space is 
a directed graph showing preference relations among the 
countable actions at a given state. At state Si, a preference 
graph &i consists of the node set Vi and edge set Zi. Each 
node represents an action in A. Define a one-to-one mapping 
ip : Vi — > A. Each edge indicates the preference relation 
between two nodes. 

Suppose we are given a dataset of observations, denoted 
as O = {S,G} = {si,€i}™ =1 . Each pair (sj,£j) consists 
of two components: one is the input s, that is a feature 
vector constructed by a mapping <j> : S — > [0, l] d ; the other, 



denoted as ej = (Vj, Si), is a two layer preference graph over 
actions observed at s,. As shown in Figure [2] the node set 
V, can be divided into two subsets: a set of nodes in the top 
layer to represent optimal actions, denoted as ; a set of 
nodes in the bottom layer to represent other actions, denoted 
as V t ~. The graph = Uu -> u^i^M G V+, v eVf}u 
Uu O b,u6 V$ }, where n, is the number of edges 

denoting strict preference relations and mj is the number 
of edges denoting equivalent relations. Consider action's 

v + 
V* 

(a) (b) 
Fig. 2. Examples of preference graph 

influence on the reward function. Here we define r as follows. 

r = (r Ql Oi), ...,r ai (s fi ), . . . ,r 0m (si), . . . ,r am (s ft )) 





( 



1 a± i 



O (4) 



where r Q ,Vj G {1,2,- •• , m}, denotes the reward only 
associated with j-th action. Given r, a ranking function can be 
naturally formulated as arrangement of the nodes in sorting 
of the values of Q-functions. We write the ranking function 
with respect to a node u at state s as Q(s, <p(u)). 

A. Bayesian inference 

Below we describe our models for prior information, 
likelihood functions, and inference. 

1) Gaussian prior: Consider r„, as a stochastic process. 
Then r aj is a Gaussian process if, for any {si, • • • ,Si}e5, 
the random variables {r a .(si), • • • ,r aj (sh)} are normally 
distributed. We denote by k aj (s c , sj) the function generating 
the value of entry (c, d) for covariance matrix K a , which 
leads to r a ~ N(0,K a .). Then the joint prior probability 
of the reward is a product of multivariate Gaussian, namely 



p(r\S) 



Ui=iPi r a 3 \S) and 



N(0,K). Thus r 



is completely specified by the positive definite covariance 
matrix K. As we assume the m latent processes are un- 
corrected, the covariance matrix K is block diagonal in 
the covariance matrices {K\, ...,K m }. In practice, we use a 
squared exponential kernel function, written as k a (s c , Sd) — 
e¥^~s d ) T M a (s c -s d)+(J 2 a ^ s ^ Sd) wh e r e M aj = Ka J fl and 

7ft is an identity matrix of size h. The function <5(.) is the 
Rronecker delta. 

2) Likelihood: Given an edge u v, we adopt a variant 
of the likelihood function proposed by Chu and Ghahramani 
in [10] to capture the preference relation in that edge. 
Specifically, 



Pidcai(" -» v\r viu) (s),r viv) (s)) 

1 if Q(s,tp(u)) > Q(a,tp(v)) 
otherwise, 



(5) 



where u and v are two nodes in the preference graph. By 
Definition 2, these nodes can be mapped to two actions ip(u) 
and <p(v) in space A. We write the Q-function as, 



Q(s, a) = r a (s) + -yP as (h ~ jPa^Ir 



(6) 



where P as and P a * are transition probabilities for the 
observed n states, and I is a matrix with h rows and 
h x m columns. The production of I and r is a n x 1 
vector containing the reward for taking the optimal action 
at each state. After assuming that the latent functions are 
contaminated with Gaussian noise that has zero mean and 
unknown variance a 2 [10], the likelihood function for l-th 
strict preference edge in graph ej becomes 

N(6 u ,0,a 2 )N(5 v ,0,a 2 )d5 u d5 v = (7) 

where z\ = 9(s " y( "' ) ^ ( ^' yfa)) , N(<5 u ,0,a 2 ) denotes a 
Gaussian distribution for 8 U , and $(z) = J_ N(7, 0, l)c?7. 
The l-th edge (ui — > vi) in preference graph denotes 
the strict preference relation ip(ui) >- <p(vi). Consequently, 
we have p(<p(ui) >- Si ip(vi)\r) = With a two-layer 

preference graph, we are only interested in the directed edges 
between two layers as well as the equivalent relation in the 
top layer. We propose a new likelihood function for the fc-th 
equivalent preference edge as follows, 



P (u k o « fc |r) oc e 4W(^*))-Q(»,fK))) ! 



(8) 



where Uk, v k G V + and the k-th edge (v,k <->■ v k ) denotes the 
equivalent relation <p(v,k) ~ Si <f{ v k)- We have p((p(uk) 
tp(vk)\r) — p(uk *H> v k \v) that is shown in EqJH] Then we 
compute the likelihood function for all observed preference 
graphs using the following equation, 



P (g\s,r,0) = nptei*>*) =nri $ ( 2 i) 

i=l i=l 1=1 

n m>i ^ 

ex p(E E -o^^' - ^ ^{vk))f). (9) 

i=l fe=l 

We put all the unknown parameters into a hyper-parameter 
vector 9 = \K aj , a aj , <j\ , and then adjust the hyper- 
parameters on the basis of maximum a posterior estimation. 

3) Posterior inference: Here we adopt a hierarchical 
model. At the lowest level are function values encoded as a 
parameter vector r. At the top level are hyper-parameters in 
6 controlling the distribution of the parameters at the bottom 
level. Inference takes place one level at a time. At the bottom 
level, the posterior over function values are given by Bayes' 
rule as p(r\S, G, 9) = p(Q\S, 9, r)p(r\S, 0)/p{G\S, 9). 

The posterior combines the information from the prior 
and the data, which reflects the updated belief about r after 
observing the decisions. By Eq. [4] our task is to minimize 



the negative log posterior equation U(r), which is 



n rrii m 



3=1 



i=X fe=l 3=1 



(10) 



i=l Z=l 



Given the k-th equivalent relation (p(uf.) ~ <p{vk), let = 
7(-P0(« fc ) Ss - P^v^sJih - jP a *)~ X , then we have 

Ot[l(oj = 0(ttk)) - l(ctj = 0(«fc))] + A fc / aj 



r a. 



and 

= 0. 



where J . is a block matrix of / = [7 0l , j 02 , • • • , / Qr7 
aj is a 1 x h vector whose entry ai(i) = 1, and at; L {—i 
The notation 1(.) is an indicator function. 

Remark Minimizing Eq[10| is a convex optimization prob 
lem. The proof can be found in our supplemental report [12] 

At the minimum of U(r) we have 

dU 



dr a . 







K aj (V\ogP(g\S,r,6)), (11) 



where f = (f i, • • • , r a . , ■ ■ ■ , r m ). In EqQT| we can use 
Newton's method to find the maximum of U with the 
iteration 

r „ew = r , d 2 U s-ldU 
a > i)r„ dr a dr a : 

B. Model selection 

Model selection is the process of choosing a covariance 
function for a Gaussian process. The process can be consid- 
ered to be training of a Gaussian process [13]. At the top 
level, we can optimize the hyper-parameters by maximizing 
the posterior over these hyper-parameters. The posterior 
p(6\G,S) is given by p(6\g,S) - p(G\S,6)p(9)/p(g\S), 
where the normalizing constant can be omitted for sim- 
plifying the optimization problem. If the prior distribution 
of hyper-parameters has no population basis, we assign 
the non-informative prior density to 9. Optimization over 9 
becomes the problem of maximizing the marginal likelihood 
p(g\S, 8). We approximate the integral of the marginal 
likelihood p(g\S,9) using a Laplace approximation local 
expansion around the maximum, which is written as 



p(g\S,0)*ip(g\S,T,6)xp(T\0)6r\s- 



(12) 



where 5 r \ S = \ — VV \nP(r\g, S, 6)\~ 5 is the posterior 
uncertainty in r, which is known as the Occam factor, 
automatically incorporating a trade-off between model fit 
and model complexity. As the number of data increases, the 
approximation is expected to become increasingly accurate. 
The marginal likelihood can be further written as 

\o SP (g\s, e) = -u{i) - i log |4 + jcn| , (13) 

where r is the MAP estimation in EqQT|and n is the second 
derivative matrix of the sum of the second and third part in 
Eq. [10] Now we can find the optimal hyper-parameters by 
maximizing EqfT3l 



(a) GPIRL accuracy 
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(b) LIRL accuracy 



Fig. 3. Average accuracy as a function of the number of observed decision 
trajectories, for GridWorld experiments. 



C. Posterior predictive reward 

When the observed state-action pairs are limited, e.g. 
in the large state space or infinite state space, how to 
predict the reward at new state is desirable. Our IRL with 
Gaussian process provides a probabilistic model to predict 
reward on new coming state s*, which is a Gaussian model 
p(r*\g,S, s* ,9) with the following mean function 

E(r* a] \g,S,s*,9) = k^s^iK^+a 2 ^)- 1 ^ 

and covariance function 

cov(r* a .\g,S,s*,6) = k aj (s*,s*) 

-k aj (S, s*) T {K ai + (S, s*), 

where k a . (S, s*) is the vector of covariance between the test 
point and training points for the covariance function relating 
to the action aj e A. 

V. Experiments 

In this section, we report on a simple GridWorld exper- 
iment in which an agent starts from the a square of the 
grid and attempts to navigate to the goal square, with the 
possibility of encountering obstacles that block movement 
to certain squares. The agent is able to take five actions: 
remaining in the current square or moving in one of the 
four cardinal directions. Each movement action results in 
movement in the intended direction with probability 0.65, 
movement in an unintended direction with probability 0.2, 
and failure to move with probability 0.15. 

We compared three algorithms: our convex programming 
method from Section [nD (CPIRL), our Gaussian process 
method from Section lTVl (GP/i?L). and the linear approxima- 
tion method in [2] (LIRL). Given observation of a complete 
policy, each of the algorithms was successful in finding a 
reward vector that yields an optimal policy identical to that 
observed. For each of the reward vectors returned by the 
algorithms, we recorded the amount of computation time 
needed to find a best policy using reinforcement learning. 
Table UJ shows the average of these time over 50 simulations. 
Notably, reinforcement learning converges more quickly with 
reward vectors returned by CPIRL and GPIRL than with 
those returned by LIRL. We hypothesize our methods tend 
to shape reward, providing additional feedback to the agent 
and leading to an improvement in learning rate. 




(a) 60-state discretization (b) 120-state discretization 

Fig. 4. Solutions to the hill climbing problem based off true reward (blue) 
and reward recovered from GPIRL (red), for two levels of discretization. 

Fig. [3] provides the basis for an accuracy comparison of 
GPIRL and LIRL for experiments in which only partial 
observations were available for reward learning. Accuracy is 
calculated to be the fraction of runs in which the apprentice 
is able to achieve the teacher's goal state. The process of 
computing accuracy includes: 1) generating some Grid World 
problems and sampling the decision trajectories from the 
teacher's demonstration; 2) inferring the reward function 
using GPIRL and LIRL; 3) generating 1000 new GridWorld 
problems with random initial state and solving these prob- 
lems by applying reinforcement learning using the reward 
output by IRL; 4) comparing the results of the GPIRL and 
LIRL apprentices with the teacher. If the apprentice reaches 
the teacher's goal state, we consider that trial a success for 
the apprentice. As can be seen in Fig. [3] the accuracy of 
GPIRL is higher than that of LIRL, especially when the 
number of observations is small. Additionally, GPIRL has 
clearly lower variance in accuracy. 

TABLE I 

TlME(SEC) TO FIND THE APPRENTICE POLICY 

GridWorld Size LIRL CPIRL GPIRL 



10x10 2.61 2.06 1.20 

20x20 20.05 15.75 9.32 

30x30 75.12 64.30 35.11 

We also performed an experiment based on a simulation 
of an under-powered car attempting to drive out of a U- 
shaped valley. In this simulation, the car lacks enough power 
to climb the valley slopes from a standstill. Instead, it must 
first reverse up a slope in order to accumulate energy that 
will help it rush up the opposite slope. We choose the car's 
position and velocity as state features, discretizing those 
naturally continuous quantities. To test GPIRL's ability to 
predict the reward on unseen states, we sampled only half the 
discretized states as the observation data for GPIRL. Given a 
state space with 120 states, for example, we would observe 
behavior in only 60 states. Figure [4] shows the number of 
steps needed to escape the valley for a range of starting 
conditions, or episodes, for policies learned from the true 
reward (blue) and from the reward returned by GPIRL (red). 
The results in the figure suggest that GPIRL is able to 
effectively recover the reward with incomplete observations, 
since the solver, using the reward predicted by GPIRL, has 



a performance on par with that of the teacher, using true 
reward. 

VI. Conclusions 

We propose new IRL algorithms in the domain of convex 
programming. To deal with the IRL problems with ill-posed 
nature in large (or even infinite) state space, we model the 
reward using Gaussian process and interpret the observation 
of state-action space using preference graphs. Our posterior 
prediction method can estimate the reward at unobserved 
new coming states, which is promising for problems with 
large state space. Numerical experiments suggest that our 
method is able to find the reward approaching the true 
underlying reward with fewer observations than are needed 
with standard approaches. We will continue our research on 
IRL with Gaussian process in continuous space. 
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